R examplesThe data handling and manipulation techniques explained will be illustrated by means of a data set of 2000 world leading companies, the Forbes 2000 list for the year 2004 collected by Forbes Magazine. This list is originally available from www.forbes.com
Here we show a subset of the data set:
| rank | name | country | category | sales | profits | assets | marketvalue |
|---|---|---|---|---|---|---|---|
| 1 | Citigroup | United States | Banking | 94.71 | 17.85 | 1264.03 | 255.30 |
| 2 | General Electric | United States | Conglomerates | 134.19 | 15.59 | 626.93 | 328.54 |
| 3 | American Intl Group | United States | Insurance | 76.66 | 6.46 | 647.66 | 194.87 |
| 4 | ExxonMobil | United States | Oil & gas operations | 222.88 | 20.96 | 166.99 | 277.02 |
| 5 | BP | United Kingdom | Oil & gas operations | 232.57 | 10.27 | 177.57 | 173.54 |
| 6 | Bank of America | United States | Banking | 49.01 | 10.81 | 736.45 | 117.55 |
| 7 | HSBC Group | United Kingdom | Banking | 44.33 | 6.66 | 757.60 | 177.96 |
| 8 | Toyota Motor | Japan | Consumer durables | 135.82 | 7.99 | 171.71 | 115.40 |
| 9 | Fannie Mae | United States | Diversified financials | 53.13 | 6.48 | 1019.17 | 76.84 |
| 10 | Wal-Mart Stores | United States | Retailing | 256.33 | 9.05 | 104.91 | 243.74 |
| 11 | UBS | Switzerland | Diversified financials | 48.95 | 5.15 | 853.23 | 85.07 |
| 12 | ING Group | Netherlands | Diversified financials | 94.72 | 4.73 | 752.49 | 54.59 |
| 13 | Royal Dutch/Shell Group | Netherlands/ United Kingdom | Oil & gas operations | 133.50 | 8.40 | 100.72 | 163.45 |
| 14 | Berkshire Hathaway | United States | Insurance | 56.22 | 6.95 | 172.24 | 141.14 |
| 15 | JP Morgan Chase | United States | Banking | 44.39 | 4.47 | 792.70 | 81.94 |
| 16 | IBM | United States | Technology hardware & equipment | 89.13 | 7.58 | 104.46 | 171.54 |
| 17 | Total | France | Oil & gas operations | 131.64 | 8.84 | 87.84 | 116.64 |
| 18 | BNP Paribas | France | Banking | 47.74 | 4.73 | 745.09 | 59.29 |
| 19 | Royal Bank of Scotland | United Kingdom | Banking | 35.65 | 4.95 | 663.45 | 90.21 |
| 20 | Freddie Mac | United States | Diversified financials | 46.26 | 10.09 | 752.25 | 44.25 |
The data consists of 2000 observations on the following 8 variables.
rank: the ranking of the company.name: the name of the company.country: a factor giving the country the company is situated in.category: a factor describing the products the company produces.sales: the amount of sales of the company in billion USD.profits: the profit of the company in billion USD.assets: the assets of the company in billion USD.marketvalue: the market value of the company in billion USD.R output
## 'data.frame': 2000 obs. of 8 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr "Citigroup" "General Electric" "American Intl Group" "ExxonMobil" ...
## $ country : Factor w/ 61 levels "Africa","Australia",..: 60 60 60 60 56 60 56 28 60 60 ...
## $ category : Factor w/ 27 levels "Aerospace & defense",..: 2 6 16 19 19 2 2 8 9 20 ...
## $ sales : num 94.7 134.2 76.7 222.9 232.6 ...
## $ profits : num 17.85 15.59 6.46 20.96 10.27 ...
## $ assets : num 1264 627 648 167 178 ...
## $ marketvalue: num 255 329 195 277 174 ...
Nominal measurements are represented by factor variables in R, such as the country of the company or the category of the business segment.
A factor in R is divided into levels
How many countries are on the top 2000 ranking?
R output
## [1] 61
Which countries?
R output
## [1] "Africa" "Australia"
## [3] "Australia/ United Kingdom" "Austria"
## [5] "Bahamas" "Belgium"
## [7] "Bermuda" "Brazil"
## [9] "Canada" "Cayman Islands"
## [11] "Chile" "China"
## [13] "Czech Republic" "Denmark"
## [15] "Finland" "France"
## [17] "France/ United Kingdom" "Germany"
## [19] "Greece" "Hong Kong/China"
## [21] "Hungary" "India"
## [23] "Indonesia" "Ireland"
## [25] "Islands" "Israel"
## [27] "Italy" "Japan"
## [29] "Jordan" "Kong/China"
## [31] "Korea" "Liberia"
## [33] "Luxembourg" "Malaysia"
## [35] "Mexico" "Netherlands"
## [37] "Netherlands/ United Kingdom" "New Zealand"
## [39] "Norway" "Pakistan"
## [41] "Panama/ United Kingdom" "Peru"
## [43] "Philippines" "Poland"
## [45] "Portugal" "Russia"
## [47] "Singapore" "South Africa"
## [49] "South Korea" "Spain"
## [51] "Sweden" "Switzerland"
## [53] "Taiwan" "Thailand"
## [55] "Turkey" "United Kingdom"
## [57] "United Kingdom/ Australia" "United Kingdom/ Netherlands"
## [59] "United Kingdom/ South Africa" "United States"
## [61] "Venezuela"
And in the top 20?
R output
## [1] "France" "Japan"
## [3] "Netherlands" "Netherlands/ United Kingdom"
## [5] "Switzerland" "United Kingdom"
## [7] "United States"
As a simple summary statistic, the frequencies of the levels of such a factor variable can be found from
##
## France Japan
## 2 1
## Netherlands Netherlands/ United Kingdom
## 1 1
## Switzerland United Kingdom
## 1 3
## United States
## 11
Which type of companies?
## [1] "Aerospace & defense" "Banking"
## [3] "Business services & supplies" "Capital goods"
## [5] "Chemicals" "Conglomerates"
## [7] "Construction" "Consumer durables"
## [9] "Diversified financials" "Drugs & biotechnology"
## [11] "Food drink & tobacco" "Food markets"
## [13] "Health care equipment & services" "Hotels restaurants & leisure"
## [15] "Household & personal products" "Insurance"
## [17] "Materials" "Media"
## [19] "Oil & gas operations" "Retailing"
## [21] "Semiconductors" "Software & services"
## [23] "Technology hardware & equipment" "Telecommunications services"
## [25] "Trading companies" "Transportation"
## [27] "Utilities"
How many of each category?
##
## Aerospace & defense Banking
## 19 313
## Business services & supplies Capital goods
## 70 53
## Chemicals Conglomerates
## 50 31
## Construction Consumer durables
## 79 74
## Diversified financials Drugs & biotechnology
## 158 45
## Food drink & tobacco Food markets
## 83 33
## Health care equipment & services Hotels restaurants & leisure
## 65 37
## Household & personal products Insurance
## 44 112
## Materials Media
## 97 61
## Oil & gas operations Retailing
## 90 88
## Semiconductors Software & services
## 26 31
## Technology hardware & equipment Telecommunications services
## 59 67
## Trading companies Transportation
## 25 80
## Utilities
## 110
A simple summary statistics such as the mean, median, quantiles and range can be found from continuous variables such as sales
R output
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.010 2.018 4.365 9.697 9.548 256.300
Chambers et al. (1983), “there is no statistical tool that is as powerful as a well chosen graph”
Histograms and boxplots
Scatterplots to visualize the relationship between variables
Using the ggplot2 library, created by Hadley Wickham.
Boxplots of the logarithms of the market value for four selected countries, the width of the boxes is proportional to the square roots of the number of companies.
Scatterplots by country
Questions:
Fisher and Belle (1993) report mortality rates due to malignant melanoma of the skin for white males during the period 1950-1969, for each state on the US mainland.
| mortality | latitude | longitude | ocean | |
|---|---|---|---|---|
| Alabama | 219 | 33.0 | 87.0 | yes |
| Arizona | 160 | 34.5 | 112.0 | no |
| Arkansas | 170 | 35.0 | 92.5 | no |
| California | 182 | 37.5 | 119.5 | yes |
| Colorado | 149 | 39.0 | 105.5 | no |
| Connecticut | 159 | 41.8 | 72.8 | yes |
| Delaware | 200 | 39.0 | 75.5 | yes |
| District of Columbia | 177 | 39.0 | 77.0 | no |
| Florida | 197 | 28.0 | 82.0 | yes |
| Georgia | 214 | 33.0 | 83.5 | yes |
| Idaho | 116 | 44.5 | 114.0 | no |
| Illinois | 124 | 40.0 | 89.5 | no |
| Indiana | 128 | 40.2 | 86.2 | no |
| Iowa | 128 | 42.2 | 93.8 | no |
| Kansas | 166 | 38.5 | 98.5 | no |
| Kentucky | 147 | 37.8 | 85.0 | no |
| Louisiana | 190 | 31.2 | 91.8 | yes |
| Maine | 117 | 45.2 | 69.0 | yes |
| Maryland | 162 | 39.0 | 76.5 | yes |
| Massachusetts | 143 | 42.2 | 71.8 | yes |
| Michigan | 117 | 43.5 | 84.5 | no |
| Minnesota | 116 | 46.0 | 94.5 | no |
| Mississippi | 207 | 32.8 | 90.0 | yes |
| Missouri | 131 | 38.5 | 92.0 | no |
| Montana | 109 | 47.0 | 110.5 | no |
| Nebraska | 122 | 41.5 | 99.5 | no |
| Nevada | 191 | 39.0 | 117.0 | no |
| New Hampshire | 129 | 43.8 | 71.5 | yes |
| New Jersey | 159 | 40.2 | 74.5 | yes |
| New Mexico | 141 | 35.0 | 106.0 | no |
| New York | 152 | 43.0 | 75.5 | yes |
| North Carolina | 199 | 35.5 | 79.5 | yes |
| North Dakota | 115 | 47.5 | 100.5 | no |
| Ohio | 131 | 40.2 | 82.8 | no |
| Oklahoma | 182 | 35.5 | 97.2 | no |
| Oregon | 136 | 44.0 | 120.5 | yes |
| Pennsylvania | 132 | 40.8 | 77.8 | no |
| Rhode Island | 137 | 41.8 | 71.5 | yes |
| South Carolina | 178 | 33.8 | 81.0 | yes |
| South Dakota | 86 | 44.8 | 100.0 | no |
| Tennessee | 186 | 36.0 | 86.2 | no |
| Texas | 229 | 31.5 | 98.0 | yes |
| Utah | 142 | 39.5 | 111.5 | no |
| Vermont | 153 | 44.0 | 72.5 | yes |
| Virginia | 166 | 37.5 | 78.5 | yes |
| Washington | 117 | 47.5 | 121.0 | yes |
| West Virginia | 136 | 38.8 | 80.8 | no |
| Wisconsin | 110 | 44.5 | 90.2 | no |
| Wyoming | 134 | 43.0 | 107.5 | no |
A data consists of 48 observations on the following 5 variables.
mortality: number of white males died due to malignant melanoma 1950-1969 per one million inhabitants.
latitude: latitude of the geographic centre of the state.
longitude: longitude of the geographic centre of each state.
ocean: a binary variable indicating contiguity to an ocean at levels no or yes.
Let us plot mortality rates in
Summary statistic of Mortality:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 86.0 128.0 147.0 152.9 178.0 229.0
Malignant melanoma mortality rates by contiguity to an ocean
Histograms can often be misleading for displaying distributions because of their dependence on the number of classes chosen. An alternative is to formally estimate the density function of a variable and then plot the resulting estimate.
The estimated densities of malignant melanoma mortality rates by contiguity to an ocean looks like this:
Now we might move on to look at how mortality rates are related to the geographic location of a state as represented by the latitude and longitude of the centre of the state.
The data contains the longitude and latitude of the centroids
We can use several R packages for mapping e.g.: library(maps) or library(maptools)
Mathematical functions
Three time series giving the monthly deaths from bronchitis, emphysema and asthma in the UK, 1974-1979, both sexes (ldeaths), males (mdeaths) and females (fdeaths).